# heatmap to approximate which relationships are more convenient to review.
f,ax = plt.subplots(figsize = (34,20))
sns.heatmap(df.corr(),\
annot = True,\
linewidths=.5,\
fmt = '.1f',\
ax = ax,\
cbar_kws={'label': 'heat index'})\
.set_title('correlation dataset')
plt.xlabel("Values on X axis")
plt.ylabel('Values on Y axis')
Text(284.09375, 0.5, 'Values on Y axis')
#
print("shape: {}\n".format(df_final.shape))
df_final.info()
shape: (72192, 20) <class 'pandas.core.frame.DataFrame'> Int64Index: 72192 entries, 1 to 113936 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ListingNumber 72192 non-null int64 1 ListingCreationDate 72192 non-null datetime64[ns] 2 ListingCategory (numeric) 72192 non-null int64 3 Term 72192 non-null int64 4 BorrowerAPR 72192 non-null float64 5 BorrowerRate 72192 non-null float64 6 LenderYield 72192 non-null float64 7 EstimatedEffectiveYield 72192 non-null float64 8 EstimatedLoss 72192 non-null float64 9 EstimatedReturn 72192 non-null float64 10 ProsperRating (numeric) 72192 non-null float64 11 ProsperScore 72192 non-null float64 12 CurrentCreditLines 72192 non-null float64 13 OpenCreditLines 72192 non-null float64 14 TotalCreditLinespast7years 72192 non-null float64 15 OpenRevolvingAccounts 72192 non-null int64 16 OpenRevolvingMonthlyPayment 72192 non-null float64 17 TotalTrades 72192 non-null float64 18 Investors 72192 non-null int64 19 year 72192 non-null int64 dtypes: datetime64[ns](1), float64(13), int64(6) memory usage: 11.6 MB
df_final.describe()
| ListingNumber | ListingCategory (numeric) | Term | BorrowerAPR | BorrowerRate | LenderYield | EstimatedEffectiveYield | EstimatedLoss | EstimatedReturn | ProsperRating (numeric) | ProsperScore | CurrentCreditLines | OpenCreditLines | TotalCreditLinespast7years | OpenRevolvingAccounts | OpenRevolvingMonthlyPayment | TotalTrades | Investors | year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 7.219200e+04 | 72192.000000 | 72192.000000 | 72192.000000 | 72192.000000 | 72192.000000 | 72192.000000 | 72192.000000 | 72192.000000 | 72192.000000 | 72192.000000 | 72192.000000 | 72192.000000 | 72192.000000 | 72192.000000 | 72192.000000 | 72192.000000 | 72192.000000 | 72192.000000 |
| mean | 7.749283e+05 | 3.599291 | 43.623670 | 0.229464 | 0.198834 | 0.188830 | 0.172994 | 0.081752 | 0.097199 | 4.019822 | 5.901180 | 10.574953 | 9.605053 | 28.312334 | 7.475676 | 441.042290 | 24.504225 | 68.644794 | 2012.392079 |
| std | 2.299283e+05 | 4.626467 | 12.271479 | 0.079313 | 0.073991 | 0.073991 | 0.068053 | 0.046383 | 0.029917 | 1.661331 | 2.382124 | 5.475676 | 5.084257 | 13.420174 | 4.663433 | 445.752131 | 11.768522 | 95.319251 | 1.074507 |
| min | 4.405430e+05 | 0.000000 | 12.000000 | 0.049350 | 0.040000 | 0.030000 | -0.177300 | 0.004900 | -0.177300 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 2.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 2010.000000 |
| 25% | 5.662400e+05 | 1.000000 | 36.000000 | 0.169690 | 0.140000 | 0.130000 | 0.122500 | 0.044900 | 0.075170 | 3.000000 | 4.000000 | 7.000000 | 6.000000 | 19.000000 | 4.000000 | 153.000000 | 16.000000 | 1.000000 | 2012.000000 |
| 50% | 7.357605e+05 | 1.000000 | 36.000000 | 0.222620 | 0.190500 | 0.180500 | 0.164900 | 0.074900 | 0.094100 | 4.000000 | 6.000000 | 10.000000 | 9.000000 | 27.000000 | 7.000000 | 310.000000 | 23.000000 | 31.000000 | 2013.000000 |
| 75% | 9.729938e+05 | 6.000000 | 60.000000 | 0.293940 | 0.259900 | 0.249900 | 0.230570 | 0.112000 | 0.118500 | 5.000000 | 8.000000 | 13.000000 | 12.000000 | 36.000000 | 10.000000 | 576.000000 | 31.000000 | 97.000000 | 2013.000000 |
| max | 1.255725e+06 | 20.000000 | 60.000000 | 0.423950 | 0.360000 | 0.340000 | 0.319900 | 0.366000 | 0.266700 | 7.000000 | 11.000000 | 59.000000 | 54.000000 | 125.000000 | 50.000000 | 13765.000000 | 118.000000 | 866.000000 | 2014.000000 |
Here we find 113,937 loans and we will use 18 columns, most of the data is numeric.
| Column | Description |
|---|---|
| ListingNumber | The number that uniquely identifies the listing to the public as displayed on the website. |
| ListingCategory (numeric) | The category of the listing that the borrower selected when posting their listing: 0 - Not Available 1 - Debt Consolidation 2 - Home Improvement 3 - Business 4 - Personal Loan 5 - Student Use 6 - Auto 7- Other 8 - Baby&Adoption 9 - Boat 10 - Cosmetic Procedure 11 - Engagement Ring 12 - Green Loans 13 - Household Expenses 14 - Large Purchases 15 - Medical/Dental 16 - Motorcycle 17 - RV 18 - Taxes 19 - Vacation 20 - Wedding Loans |
| Term | The length of the loan expressed in months. |
| BorrowerAPR | The Borrower's Annual Percentage Rate (APR) for the loan. |
| BorrowerRate | The Borrower's interest rate for this loan. |
| LenderYield | The Lender yield on the loan. Lender yield is equal to the interest rate on the loan less the servicing fee. |
| EstimatedEffectiveYield | Effective yield is equal to the borrower interest rate (i) minus the servicing fee rate, (ii) minus estimated uncollected interest on charge-offs, (iii) plus estimated collected late fees. Applicable for loans originated after July 2009. |
| EstimatedLoss | Estimated loss is the estimated principal loss on charge-offs. Applicable for loans originated after July 2009. |
| EstimatedReturn | The estimated return assigned to the listing at the time it was created. Estimated return is the difference between the Estimated Effective Yield and the Estimated Loss Rate. Applicable for loans originated after July 2009 |
| ProsperRating (numeric) | The Prosper Rating assigned at the time the listing was created: 0 - N/A 1 - HR 2 - E 3 - D 4 - C 5 - B 6 - A 7 - AA Applicable for loans originated after July 2009. |
| ProsperScore | A custom risk score built using historical Prosper data. The score ranges from 1-10, with 10 being the best, or lowest risk score. Applicable for loans originated after July 2009. |
| CurrentCreditLines | Number of current credit lines at the time the credit profile was pulled. |
| OpenCreditLines | Number of open credit lines at the time the credit profile was pulled. |
| TotalCreditLinespast7years | Number of credit lines in the past seven years at the time the credit profile was pulled. |
| OpenRevolvingAccounts | Number of open revolving accounts at the time the credit profile was pulled. |
| OpenRevolvingMonthlyPayment | Monthly payment on revolving accounts at the time the credit profile was pulled |
| TotalTrades | Number of trade lines ever opened at the time the credit profile was pulled. |
| Investors | The number of investors that funded the loan. |
# let's do a general review of the variables
df_final.hist(figsize=(20,12));
# Prosper Rating hist
bin_edges = np.arange(0, df_final['ProsperRating (numeric)'].max()+1, 1)
plt.figure(figsize=[12, 8])
plt.hist(data = df_final, x = 'ProsperRating (numeric)', bins = bin_edges)
plt.xticks([0, 1, 2, 3, 4, 5, 6, 7], ['N/A', 'HR', 'E', 'D', 'C', 'B', 'A', 'AA'])
plt.xlabel('Prosper Rating')
plt.ylabel('Quantity')
plt.title('Prosper Rating types', fontweight="bold")
plt.show()
We could say that the Prosper Rating of the applicants in most cases is on the rise, located in the most risky, where we observe a small decrease with respect to the pattern is type A
# Prosper Score hist
bin_edges = np.arange(0, df_final['ProsperScore'].max()+1, 1)
plt.figure(figsize=[12, 8])
plt.hist(data = df_final, x = 'ProsperScore', bins = bin_edges)
plt.xticks([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12], ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12'])
plt.xlabel('Prosper Score')
plt.ylabel('amount')
plt.title('Prosper Score', fontweight="bold")
plt.show()
We can the scores of the users are mostly between 5 and 9
# Estimated Loss hist
bin_edges = np.arange(0, df_final['EstimatedLoss'].max()+1, 1)
plt.figure(figsize=[12, 8])
plt.hist(data = df_final, x = 'EstimatedLoss')#, bins = bin_edges)
#plt.xticks([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12], ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12'])
plt.xlabel('EstimatedLoss')
plt.ylabel('amount')
plt.title('Estimated Loss', fontweight="bold")
plt.show()
We observe in this graph that the estimated losses are mostly quite low.
# Estimated Return hist
bin_edges = np.arange(0, df_final['EstimatedReturn'].max()+1, 1)
plt.figure(figsize=[12, 8])
plt.hist(data = df_final, x = 'EstimatedReturn')#, bins = bin_edges)
#plt.xticks([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12], ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12'])
plt.xlabel('EstimatedReturn')
plt.ylabel('amount')
plt.title('Estimated Return', fontweight="bold")
plt.show()
classically always the estimated return rate is higher than the loss rate
# Boxplot for estimated loss
plt.subplots(figsize=(18,10))
bp = plt.boxplot(df_final['EstimatedLoss'], vert=False, showmeans=True)
plt.title(('Estimated Loss boxplot'), fontweight="bold")
plt.xlabel('Estimated Loss')
Text(0.5, 0, 'Estimated Loss')
There are still some outliers left, but let's see the median is a little lower than the mean, and the losses are found with a mean of 0.08
For this analysis we have the majority of variables that are related to each other, so we are going to make some plots to observe the relationships
# heatmap
f,ax = plt.subplots(figsize = (34,20))
sns.heatmap(df_final.corr(),\
annot = True,\
linewidths=.5,\
fmt = '.1f',\
ax = ax,\
cbar_kws={'label': 'heat index'})\
.set_title('correlation dataset')
plt.xlabel("Values on X axis")
plt.ylabel('Values on Y axis')
Text(285.0, 0.5, 'Values on Y axis')
# Subplots for BorrowerAPR
fig=plt.figure(figsize=(24, 12))
fig.suptitle("Scatterplot of BorrowerAPR's closest relationships")
columns = 3
rows = 2
column = ['BorrowerRate', 'LenderYield', 'EstimatedEffectiveYield', 'EstimatedLoss', 'EstimatedReturn']
for i in range(1, 6):
fig.add_subplot(rows, columns, i)
sns.scatterplot(x = 'BorrowerAPR',\
y = column[i-1],\
alpha = 0.5,\
data = df_final).set_title("BorrowerAPR vs {}".format(column[i-1]))
plt.show()
As we can see a strong positive relationship between BorrowerAPR and BorrowerRate / LenderYield, but with the other variables, even though a positive relationship is maintained, it becomes less strong.
# Subplots for BorrowerRate
fig=plt.figure(figsize=(24, 12))
fig.suptitle("Scatterplot of BorrowerAPR's closest relationships")
columns = 3
rows = 2
column = ['BorrowerAPR', 'LenderYield', 'EstimatedEffectiveYield', 'EstimatedLoss', 'EstimatedReturn']
for i in range(1, 6):
fig.add_subplot(rows, columns, i)
sns.scatterplot(x = 'BorrowerRate',\
y = column[i-1],\
alpha = 0.5,\
data = df_final).set_title("BorrowerRate vs {}".format(column[i-1]))
plt.show()
With the variable BorrowerRate, the strongest relationship is LenderYield then BorrowerAPR, and hence the relationship is a bit more unstable, but always positive.
# Subplots for LenderYield
fig=plt.figure(figsize=(24, 12))
fig.suptitle("Scatterplot of LenderYield's closest relationships")
columns = 3
rows = 2
column = ['BorrowerAPR', 'BorrowerRate', 'EstimatedEffectiveYield', 'EstimatedLoss', 'EstimatedReturn']
for i in range(1, 6):
fig.add_subplot(rows, columns, i)
sns.scatterplot(x = 'LenderYield',\
y = column[i-1],\
alpha = 0.5,\
data = df_final).set_title("LenderYield vs {}".format(column[i-1]))
plt.show()
The LenderYield variable has the same behavior as BorrowerRate, and hence its perfect relationship.
# Subplots for EstimatedEffectiveYield
fig=plt.figure(figsize=(24, 12))
fig.suptitle("Scatterplot of EstimatedEffectiveYield's closest relationships")
columns = 3
rows = 2
column = ['BorrowerAPR', 'BorrowerRate', 'LenderYield', 'EstimatedLoss', 'EstimatedReturn']
for i in range(1, 6):
fig.add_subplot(rows, columns, i)
sns.scatterplot(x = 'EstimatedEffectiveYield',\
y = column[i-1],\
alpha = 0.5,\
data = df_final).set_title("EstimatedEffectiveYield vs {}".format(column[i-1]))
plt.show()
# Subplots for EstimatedLoss
fig=plt.figure(figsize=(24, 12))
fig.suptitle("Scatterplot of EstimatedLoss's closest relationships")
columns = 3
rows = 2
column = ['BorrowerAPR', 'BorrowerRate', 'LenderYield', 'EstimatedEffectiveYield', 'EstimatedReturn']
for i in range(1, 6):
fig.add_subplot(rows, columns, i)
sns.scatterplot(x = 'EstimatedLoss',\
y = column[i-1],\
alpha = 0.5,\
data = df_final).set_title("EstimatedLoss vs {}".format(column[i-1]))
plt.show()
# Subplots for EstimatedReturn
fig=plt.figure(figsize=(24, 12))
fig.suptitle("Scatterplot of EstimatedReturn's closest relationships")
columns = 3
rows = 2
column = ['BorrowerAPR', 'BorrowerRate', 'LenderYield', 'EstimatedEffectiveYield', 'EstimatedLoss']
for i in range(1, 6):
fig.add_subplot(rows, columns, i)
sns.scatterplot(x = 'EstimatedReturn',\
y = column[i-1],\
alpha = 0.5,\
data = df_final).set_title("EstimatedReturn vs {}".format(column[i-1]))
plt.show()
The following variables have positive relationships with the other variables but it is a bit uneven with respect to the first visualizations.
# Subplots for ProsperRating (numeric)
fig=plt.figure(figsize=(24, 12))
fig.suptitle("Barplot of ProsperRating's closest relationships")
columns = 3
rows = 2
column = ['BorrowerAPR', 'BorrowerRate', 'LenderYield', 'EstimatedEffectiveYield','EstimatedReturn', 'EstimatedLoss']
for i in range(1, 7):
fig.add_subplot(rows, columns, i)
sns.barplot(x = 'ProsperRating (numeric)',\
y = column[i-1],\
data = df_final).set_title("ProsperRating vs {}".format(column[i-1]))
plt.xticks([0, 1, 2, 3, 4, 5, 6], ['HR', 'E', 'D', 'C', 'B', 'A', 'AA'])
plt.show()
Here we see a negative relationship between ProsperRating vs BorrowerAPR / BorrowerRate / LenderYield / EstimatedEffectiveYield / EstimatedReturn / EstimatedLoss
# violin plot
plt.figure(figsize = [24, 10])
ax = sns.violinplot(data = df_final, x = 'ProsperRating (numeric)', y = 'ProsperScore')
plt.xticks([0, 1, 2, 3, 4, 5, 6], ['HR', 'E', 'D', 'C', 'B', 'A', 'AA'])
ax.set_title("ProsperRating (numeric) vs ProsperScore")
plt.show()
In this plot we can see how the higher the ProsperRating, the higher the ProsperScore, a positive relationship but represented in another way.
As we can see in the graphs, there is a fairly close relationship between the borrower and estimated columns, this makes us know that the more one grows, the greater the other will be, which gives us greater clarity about the risks of these loans.
We also observe the interesting relationship between prosper rating and prosper score, where we observe that most of the users evaluated with high degrees are the safest to invest, and have a higher prosper score, which is good since it generates mutual security.
Here we will try to use the "year" column for the following graphs, and we will try to make some interesting plots about the loan.
# first we must go to prosper rating to letters
new_prosper_rating = {'ProsperRating (numeric)': {0:'N/A', 1:'HR', 2:'E', 3:'D', 4:'C', 5:'B', 6:'A', 7:'AA'}}
df_final.replace(new_prosper_rating, inplace = True)
df_final.head()
| ListingNumber | ListingCreationDate | ListingCategory (numeric) | Term | BorrowerAPR | BorrowerRate | LenderYield | EstimatedEffectiveYield | EstimatedLoss | EstimatedReturn | ProsperRating (numeric) | ProsperScore | CurrentCreditLines | OpenCreditLines | TotalCreditLinespast7years | OpenRevolvingAccounts | OpenRevolvingMonthlyPayment | TotalTrades | Investors | year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1209647 | 2014-02-27 08:28:07.900 | 2 | 36 | 0.12016 | 0.0920 | 0.0820 | 0.07960 | 0.0249 | 0.05470 | A | 7.0 | 14.0 | 14.0 | 29.0 | 13 | 389.0 | 29.0 | 1 | 2014 |
| 3 | 658116 | 2012-10-22 11:02:35.010 | 16 | 36 | 0.12528 | 0.0974 | 0.0874 | 0.08490 | 0.0249 | 0.06000 | A | 9.0 | 5.0 | 5.0 | 29.0 | 7 | 115.0 | 26.0 | 158 | 2012 |
| 5 | 1074836 | 2013-12-14 08:26:37.093 | 1 | 60 | 0.15425 | 0.1314 | 0.1214 | 0.11567 | 0.0449 | 0.07077 | B | 10.0 | 21.0 | 17.0 | 49.0 | 13 | 1410.0 | 47.0 | 1 | 2013 |
| 7 | 768193 | 2013-05-05 06:49:27.493 | 2 | 36 | 0.23939 | 0.2019 | 0.1919 | 0.17830 | 0.0799 | 0.09840 | C | 4.0 | 6.0 | 6.0 | 10.0 | 5 | 101.0 | 10.0 | 1 | 2013 |
| 8 | 1023355 | 2013-12-02 10:43:39.117 | 7 | 36 | 0.07620 | 0.0629 | 0.0529 | 0.05221 | 0.0099 | 0.04231 | AA | 9.0 | 17.0 | 16.0 | 32.0 | 12 | 219.0 | 29.0 | 1 | 2013 |
In these plots we mostly use the Prosper rating as the third variable with respect to relationships.
sns.lmplot(x='BorrowerAPR',height = 10, y='EstimatedLoss', hue='ProsperRating (numeric)',
fit_reg=True, data=df_final)
plt.show()
As we can see in the relationship between BorrowerAPR and EstimatedLoss, the prosper rating greatly influences, since the higher the rank, the lower the estimated loss, since they are the most reliable users for loans.
plt.figure(figsize = [24, 14])
sns.pointplot(data = df_final, x = 'TotalTrades', y = 'TotalCreditLinespast7years', hue = 'ProsperRating (numeric)',
palette = 'Blues', linestyles = '', dodge = 0.4)
plt.xticks(rotation='vertical');
plt.show()
Another interesting graph about the relationship between the total trade with that of the past 7 years, we can see that it is constant and as the highest prosper ratings mostly have fewer trades
g = sns.FacetGrid(data = df_final, height = 10, col = 'year', col_wrap = 5, hue = 'ProsperRating (numeric)')
g.map(sns.countplot, "ProsperScore")
leg = plt.legend(loc = 'upper right', title = 'ProsperRating (numeric)')
plt.show()
Looking at the passing of the years, it is observed that in the more recent years the prosper rating and score relationship is stronger, and with more consistency, since the higher the rank, the higher the score.
sns.relplot(x="LenderYield",height = 20, y="EstimatedReturn", hue="ProsperRating (numeric)",
kind="scatter", data=df_final);
This last graph is another example of how the higher the prosper rating, the lower the loss and the credit payment.
As we observe in the plots, the prosper rating and score are fundamental to deduce the increase or decrease of the other loan relationships that have to do with the borrower and estimates of return or loss, and even though the column of years is not decisive, it can be see with interest the behavior of the other variables.
There are few qualitative relationships, beyond the prosper rating, since there are still other similar columns that do not keep much correlation with the other variables.
I think that the information provided about the loans is not very enriching, and I tried to understand a little from the outside but I do not know if it captures well the possible concerns to be raised.
Too many variables, I took the ones that I saw that had a better relationship between them, but maybe someone more expert in loans could make a more successful decision.